---
title: Build a recipe
description: DataRobot leverages the compute environment and distributed architecture of your data source to quickly perform exploratory data analysis and apply transformations as you build your recipe.

---

# Build a recipe {: #build-a-recipe }

Building a recipe is the first step in preparing your data. When you start a Wrangle session, DataRobot connects to your data source, pulls a live random sample, and performs exploratory data analysis on that sample. When you add operations to your recipe, the transformation is applied to the sample and the exploratory data insights are recalculated, allowing you to quickly iterate on and profile your data before publishing.

See the associated [considerations](wb-data-ref/index#wrangle-data) for important additional information. For a complete list of available connections in Workbench and which features they support, see the [connection capabilities table](wb-data-ref/index#connection-capabilities).


!!! warning "Wrangling requirement"
    To wrangle data, you must [add a dataset using a configured data connection](wb-connect). 

??? note "Operation behavior"
    When a wrangling recipe is pushed down to the connected cloud data platform, the operations are executed in their environment. To understand how operations behave, refer to the documentation for your data platform:
    
      - [Snowflake documentation](https://docs.snowflake.com/en/sql-reference-functions){ target=_blank }
      - [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators){ target=_blank }


    Once the dataset is materialized in DataRobot and added to your Use Case, you can go to the **Datasets** tab and [view which queries were executed](wb-data-tab#view-wrangling-recipe-sql) by the cloud data platform during pushdown.

!!! info "Public preview"
     The ability to wrangle data from Azure Databricks, and the associated feature flags, are on by default.

     <b>Feature flag(s):</b>

     - Enable Databricks Driver
     - Enable Databricks Wrangling
     - Enable Dynamic Datasets in Workbench
     
## Configure the live sample {: #configure-the-live-sample }

By default, DataRobot retrieves 10000 rows for the live sample, however, you can modify this number in the wrangling settings. Note that the more rows you retrieve, the longer it will take to render the live sample.

To configure the live sample:

1. Click **Settings** in the right panel and open **Interactive sample**.

    ![](images/wb-operation-1.png)

2. Enter the number of rows (under 10000) you want to include in the live sample and click **Resample**. The live sample updates to display the specified number of rows.

    ![](images/wb-operation-2.png)

## Analyze the live sample {: #analyze-the-live-sample }

During data wrangling, DataRobot performs exploratory data analysis on the live sample, generating table- and column-level [summary statistics](histogram){ target=_blank } and [visualizations](histogram#histogram-chart){ target=_blank } that help you profile the dataset and recognize data quality issues as you apply operations. For more information on interacting with the live sample, see the section on [exploratory data insights](wb-data-tab#view-exploratory-data-insights).

![](images/wb-operation-13.png)

??? tip "Speed up live sample"
    To speed up the time it takes to retrieve and render the live sample, use the toggle next to **Show Insights** to hide the feature distribution charts.

??? faq "Live sample vs. exploratory data insights on the Datasets tab"
    Although both pages provide similar insights, you can specify the number of rows displayed in the live sample and it updates each time a transformation is added to your recipe.

## Add operations {: #add-operations }

A recipe is composed of operations&mdash;transformations that will be applied to the source data to prepare it for modeling. Note that operations are applied sequentially, so you may need to [reorder the operations](#reorder-operations) in your recipe to achieve the desired result.

The table below describes the wrangling operations currently available in Workbench:

Operation | Description
--------- | -----------
[Join](#join) (public preview)   |  Join datasets that are accessible via the same connection instance.
[Aggregate](#aggregate) (public preview)   | Apply mathematical aggregations to features in your dataset.
[Compute new feature](#compute-a-new-feature) | Create a new feature using Snowflake scalar subqueries, scalar functions, or window functions.
[Filter row](#filter-row) | Filter the rows in your dataset according to specified value(s) and conditions
[De-duplicate rows](#de-duplicate-row) | Automatically remove all duplicate rows from your dataset.
[Find and replace](#find-and-replace) | Replace specific feature values in a dataset.
[Rename features](#rename-features) | Change the name of one or more features in your dataset.
[Remove features](#remove-features) | Remove one or more features from your dataset.

??? faq "Can I perform majority class downsampling for unbalanced datasets?"
    Yes, you can enable majority class downsampling during the [publishing phase](wb-pub-recipe#configure-smart-downsampling) of wrangling. In Workbench, downsampling happens in-source and sampling weight is generated. The target and weights are then passed along to the experiment.

To add an operation to your recipe:

1. With **Recipe** selected, click **Add Operation** in the right panel.

    ![](images/wb-operation-12.png)

2. Select and configure an operation. Then, click **Add to recipe**.

    The live sample updates after DataRobot retrieves a new sample from the data source and applies the operation, allowing you to review the transformation in realtime.

3. Continue adding operations while analyzing their effect on the live sample; when you're done, the [recipe is ready to be published](wb-pub-recipe).

    ![](images/wb-operation-11.png)

### Join {: #join }

!!! info "Public preview"
     The Join operation is on by default. 

     <b>Feature flag:</b> Enables Additional Wrangler Operations

Use the **Join** operation to combine datasets that are accessible via the same connection instance.

To join a table or dataset:

1. Click **Join** in the right panel.

    ![](images/wb-join-1.png)

2. Click **+ Select dataset** to browse and select a dataset from your connection instance.

    ![](images/wb-join-2.png)

3. Once you've opened and profiled the dataset you want to add, click **Select**.

    ![](images/wb-join-3.png)

4. Select the appropriate **Join type** from the dropdown. 

    - **Inner** only returns rows that have matching values in both datasets, for example, any rows with matching values in the `order_id` column.
    - **Left** returns all rows from the left dataset (the original), and only the rows with matching values in the right dataset (joined).

    ![](images/wb-join-5.png)

5. Select the **Join condition**, which defines how the two datasets are related. In this example, both the datasets are related by `order_id`.

    ![](images/wb-join-6.png)

6. Click **Add to recipe**.

### Aggregate {: #aggregate }

!!! info "Public preview"
     The Aggregate operation is on by default. 

     <b>Feature flag:</b> Enables Additional Wrangler Operations

Use the **Aggregate** operation to apply the following mathematical aggregations to the dataset (available aggregations vary by feature type):

- Sum
- Min
- Max
- Avg
- Standard deviation
- Count
- Count distinct
- Most frequent (Snowflake only)

To add an aggregation:

1. Click **Aggregate** in the right panel.

    ![](images/wb-aggregate-1.png)

2. Under **Group by key**, select the feature(s) you want to group your aggregation(s) by.

    ![](images/wb-aggregate-2.png)

3. Click the field below **Feature to aggregate** and select a feature from the dropdown. Then, click the field below **Aggregate function** and choose one or more aggregations to apply to the feature.

    ![](images/wb-aggregate-3.png)

4. (Optional) Click **+ Add feature** to apply aggregations to additional features in this grouping.

5. Click **Add to recipe**.

    After adding the operation to the recipe, DataRobot renames aggregated features using the original name with the `_AggregationFunction` suffix attached. In this example, the new columns are `age_max` and `age_most_frequent`.

    ![](images/wb-aggregate-4.png)

### Compute a new feature {: #compute-a-new-feature }

Use the **Compute new feature** operation to create a new output feature from existing features in your dataset. By applying domain knowledge, you can create features that do a better job of representing your business problem to the model than those in the original dataset.

To compute a new feature:

1. Click **Compute new feature** in the right panel.

    ![](images/wb-operation-10.png)

2. Enter a name for the new feature, and under **Expression**, define the feature using scalar subqueries, scalar functions, or window functions for your chosen cloud data platform:

    === "Snowflake"

        See the Snowflake documentation for:

        - [Scalar subqueries](https://docs.snowflake.com/en/user-guide/querying-subqueries#scalar-subqueries.){ target=_blank }
        - [Scalar functions](https://docs.snowflake.com/en/sql-reference/functions){ target=_blank }
        - [Window functions](https://docs.snowflake.com/en/sql-reference/functions-analytic){ target=_blank }

    === "BigQuery"

        See the BigQuery documentation for:

        - [Scalar subqueries](https://cloud.google.com/bigquery/docs/reference/standard-sql/subqueries#scalar_subquery_concepts){ target=_blank }
        - [Scalar functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators){ target=_blank }
        - [Window functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls){ target=_blank }

    === "Databricks"

        See the Databricks documentation for:

        - [Scalar subqueries](https://docs.databricks.com/en/sql/language-manual/sql-ref-expression.html#scalar-subquery){ target=_blank }
        - [Scalar functions](https://docs.databricks.com/en/sql/language-manual/sql-ref-functions-builtin.html){ target=_blank }
        - [Window functions](https://docs.databricks.com/en/sql/language-manual/sql-ref-window-functions.html){ target=_blank }
  
    ![](images/wb-operation-14.png)

    This example uses `REGEXP_SUBSTR`, to extract the first number from the `[<age_range_start> - <age_range_end>)` from the `age` column, and `to_number` to convert the output from a string to a number.

3. Click **Add to recipe**.

### Filter row {: #filter-row }

Use the **Filter row** operation to filter the rows in your dataset according to specified value(s) and conditions.

To filter rows:

1. Click **Filter row** in the right panel.

    ![](images/wb-operation-8.png)

2. Decide if you want to keep the rows that match the defined conditions or exclude them.

3. Define the filter conditions, by choosing the feature you want to filter, the condition type, and the value you want to filter by. DataRobot highlights the selected column.

    ![](images/wb-operation-7.png)

4. (Optional) Click **Add condition** to define additional filtering criteria.

5. Click **Add to recipe**.

### De-duplicate row {: #de-duplicate-row }

Use the **De-duplicate rows** operation to automatically remove all rows with duplicate information from the dataset.

To de-duplicate rows, click De-duplicate rows in the right panel. This operation is immediately added to your recipe and applied to the live sample.

![](images/wb-operation-15.png)

### Find and replace {: #find-and-replace }

Use the **Find and replace** operation to quickly replace specific feature values in a dataset. This is helpful to, for example, fix typos in a dataset.

To find and replace a feature value:

1. Click **Find and replace** in the right panel.

    ![](images/wb-operation-9.png)

2. Under **Select feature**, click the dropdown and choose the feature that contains the value you want to replace. DataRobot highlights the selected column.

    ![](images/wb-operation-3.png)

3. Under **Find**, choose the match criteria&mdash;**Exact**, **Partial**, or **Regular Expression**&mdash;and enter the feature value you want to replace. Then, under **Replace**, enter the new value.

    ![](images/wb-operation-4.png)

4. Click **Add to recipe**.

### Rename features {: #rename-features }

Use the **Rename features** operation to rename one or more features in the dataset.

To rename features:

1. Click **Rename features** in the right panel.

    ![](images/wb-operation-16.png)

    ??? tip "Rename specific features from the live sample"
        Alternatively, you can click the **More options** icon next to the feature you want to rename. This opens the operation parameters in the right panel with the feature field already filled in.

        ![](images/wb-operation-21.png)

2. Under **Feature name**, click the dropdown and choose the feature you want to rename. Then, enter the new feature name in the second field.

    ![](images/wb-operation-18.png)

4. (Optional) Click **Add feature** to rename additional features.

5. Click **Add to recipe**.

### Remove features {: #remove-features }

Use the **Remove features** operation to remove features from the dataset.

To remove features:

1. Click **Remove features** in the right panel.

    ![](images/wb-operation-19.png)

    ??? tip "Remove specific features from the live sample"
        Alternatively, you can click the **More options** icon next to the feature you want to remove. This opens the operation parameters in the right panel with the feature field already filled in.

        ![](images/wb-operation-21.png)

2. Under **Feature name**, click the dropdown and either start typing the feature name or scroll through the list to select the feature(s) you want to remove. Click outside of the dropdown when you're done selecting features.

    ![](images/wb-operation-20.png)

3. Click **Add to recipe**.

## Reorder operations {: #reorder-operations }

All operations in a wrangling recipe are applied sequentially, therefore, the order in which they appear affects the results of the output dataset.

To move an operation to a new location, click and hold the operation you want to move, and then drag it to a new position.

![](images/wb-op-reorder.png)

The live sample updates to reflect the new order.

## Quit wrangling {: #quit-wrangling }

At any point, you can click **Quit Wrangling** to end your wrangling session, however, any operations applied to the dataset will be removed.

![](images/wb-operation-quit.png)

## Next steps {: #next-steps }

From here, you can:

- [Publish the recipe to the data source, generating a new output dataset.](wb-pub-recipe)

## Read more {: #read-more}

To learn more about the topics discussed on this page, see:

- [Description of summary statistics and histograms in DataRobot Classic.](histogram){ target=_blank }
